This is me working through relevant exercises in R for Data Science by Garrett Grolemund and Hadley Wickham.

Chapter 5 - Data Transformation

Filter rows with filter()

Examples

Load packages, data.

library(nycflights13)
library(tidyverse)

Select flights that happened on January 1, 2013.

jan1 <- filter(flights, month == 1, day == 1)
jan1

# alternatively, enclose in parens
(jan1 <- filter(flights, month == 1, day == 1))

When dealing with floating point numbers (decimals), use near() rather than == for logical statements.

sqrt(2)^2 == 2
## [1] FALSE
near(sqrt(2)^2, 2)
## [1] TRUE

Instead of writing long disjunctives, use %in%.

nov_dec <- filter(flights, month == 11 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))

DeMorgan’s Law:

  • !(x & y) == (!x | !y)
  • !(x | y) == (!x & !y)

If you want to include rows that are NA, you have to do so explicitly (for filter()).

df <- tibble(x = c(1, NA, 3))
filter(df, is.na(x) | x > 1)
## # A tibble: 2 x 1
##       x
##   <dbl>
## 1    NA
## 2     3

Exercises

  1. Find all flights that:
  1. Had an arrival delay of two or more hours
delay <- filter(flights, arr_delay >=120)
summary(delay$arr_delay)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   120.0   137.0   163.0   184.2   207.0  1272.0
  1. Flew to Houston (IAH or HOU)
houston <- filter(flights, dest %in% c("IAH", "HOU"))
table(houston$dest)
## 
##  HOU  IAH 
## 2115 7198
  1. Were operated by United, American, or Delta
airlines_filt <- filter(flights, carrier %in% c("UA", "AA", "DL"))
table(airlines_filt$carrier)
## 
##    AA    DL    UA 
## 32729 48110 58665
  1. Departed in summer (July, August, and September)
summer <- filter(flights, month %in% c(7,8,9))
table(summer$month)
## 
##     7     8     9 
## 29425 29327 27574
  1. Arrived more than two hours late, but didn’t leave late
arrdelay <- filter(flights, dep_delay <=0 & arr_delay > 120)
arrdelay
## # A tibble: 29 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1    27     1419           1420        -1     1754
##  2  2013    10     7     1350           1350         0     1736
##  3  2013    10     7     1357           1359        -2     1858
##  4  2013    10    16      657            700        -3     1258
##  5  2013    11     1      658            700        -2     1329
##  6  2013     3    18     1844           1847        -3       39
##  7  2013     4    17     1635           1640        -5     2049
##  8  2013     4    18      558            600        -2     1149
##  9  2013     4    18      655            700        -5     1213
## 10  2013     5    22     1827           1830        -3     2217
## # ... with 19 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. Were delayed by at least an hour, but made up over 30 minutes in flight
makeup <- filter(flights, dep_delay >= 60 & (arr_delay < dep_delay - 30))
  1. Departed between midnight and 6am (inclusive)
morning <- filter(flights, dep_time >= 0 & dep_time <= 600)
summary(morning$dep_time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   540.0   554.0   488.7   557.0   600.0
  1. Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

between() allows you to select a range for a value (e.g., between 7 and 9). It is inclusive on both sides.

summer <- filter(flights, between(month, 7, 9))
morning <- filter(flights, between(dep_time, 0, 600))
  1. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
missing_dep <- filter(flights, is.na(dep_time))
missing_dep
## # A tibble: 8,255 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1       NA           1630        NA       NA
##  2  2013     1     1       NA           1935        NA       NA
##  3  2013     1     1       NA           1500        NA       NA
##  4  2013     1     1       NA            600        NA       NA
##  5  2013     1     2       NA           1540        NA       NA
##  6  2013     1     2       NA           1620        NA       NA
##  7  2013     1     2       NA           1355        NA       NA
##  8  2013     1     2       NA           1420        NA       NA
##  9  2013     1     2       NA           1321        NA       NA
## 10  2013     1     2       NA           1545        NA       NA
## # ... with 8,245 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

These flights (n = 8255) are missing actual departure and arrival times, as well as air time and sometimes tail number. These are probably filghts that were cancelled.

  1. Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

Most of those examples don’t take the NA into account – anything to the zero-th is 1, anything OR TRUE is true, anything and FALSE is false. As for NA * 0 – if NA was Inf, it would be NA. I think this basically just means that you should check how different functions use NA before using them on missing data.

Arrange rows with arrange()

Examples

Adding multiple columns is like saying THEN: arrange by year THEN month THEN day.

arrange(flights, year, month, day)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

desc() does descending order.

arrange(flights, desc(dep_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     9      641            900      1301     1242
##  2  2013     6    15     1432           1935      1137     1607
##  3  2013     1    10     1121           1635      1126     1239
##  4  2013     9    20     1139           1845      1014     1457
##  5  2013     7    22      845           1600      1005     1044
##  6  2013     4    10     1100           1900       960     1342
##  7  2013     3    17     2321            810       911      135
##  8  2013     6    27      959           1900       899     1236
##  9  2013     7    22     2257            759       898      121
## 10  2013    12     5      756           1700       896     1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Missing values are at the end (like Rstudio Viewer does it).

df <- tibble(x = c(5, 2, NA))
arrange(df, x)
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1     2
## 2     5
## 3    NA

Exercises

  1. How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).
df <- tibble(x = c(5, 2, NA))
arrange(df, desc(is.na(x)))
## # A tibble: 3 x 1
##       x
##   <dbl>
## 1    NA
## 2     5
## 3     2
  1. Sort flights to find the most delayed flights. Find the flights that left earliest.
arrange(flights, desc(dep_delay), dep_time)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     9      641            900      1301     1242
##  2  2013     6    15     1432           1935      1137     1607
##  3  2013     1    10     1121           1635      1126     1239
##  4  2013     9    20     1139           1845      1014     1457
##  5  2013     7    22      845           1600      1005     1044
##  6  2013     4    10     1100           1900       960     1342
##  7  2013     3    17     2321            810       911      135
##  8  2013     6    27      959           1900       899     1236
##  9  2013     7    22     2257            759       898      121
## 10  2013    12     5      756           1700       896     1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>
  1. Sort flights to find the fastest flights.
arrange(flights, air_time)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1    16     1355           1315        40     1442
##  2  2013     4    13      537            527        10      622
##  3  2013    12     6      922            851        31     1021
##  4  2013     2     3     2153           2129        24     2247
##  5  2013     2     5     1303           1315       -12     1342
##  6  2013     2    12     2123           2130        -7     2211
##  7  2013     3     2     1450           1500       -10     1547
##  8  2013     3     8     2026           1935        51     2131
##  9  2013     3    18     1456           1329        87     1533
## 10  2013     3    19     2226           2145        41     2305
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Woooo Newark to Bradley – 20 minutes!

  1. Which flights travelled the longest? Which travelled the shortest?
arrange(flights, distance)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     7    27       NA            106        NA       NA
##  2  2013     1     3     2127           2129        -2     2222
##  3  2013     1     4     1240           1200        40     1333
##  4  2013     1     4     1829           1615       134     1937
##  5  2013     1     4     2128           2129        -1     2218
##  6  2013     1     5     1155           1200        -5     1241
##  7  2013     1     6     2125           2129        -4     2224
##  8  2013     1     7     2124           2129        -5     2212
##  9  2013     1     8     2127           2130        -3     2304
## 10  2013     1     9     2126           2129        -3     2217
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Lol, there’s a cancelled flight between Newark and LaGuardia. Otherwise it’s Newark to Philly as the shortest.

arrange(flights, desc(distance))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      857            900        -3     1516
##  2  2013     1     2      909            900         9     1525
##  3  2013     1     3      914            900        14     1504
##  4  2013     1     4      900            900         0     1516
##  5  2013     1     5      858            900        -2     1519
##  6  2013     1     6     1019            900        79     1558
##  7  2013     1     7     1042            900       102     1620
##  8  2013     1     8      901            900         1     1504
##  9  2013     1     9      641            900      1301     1242
## 10  2013     1    10      859            900        -1     1449
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Looks like it’s a flight between JFK and Honolulu, which is almost 5,000 miles. It’s only about 9 hours!

Select columns with select()

Examples

Select lets you pick out columns.

# Select columns by name
select(flights, year, month, day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay
##       <int>          <int>     <dbl>    <int>          <int>     <dbl>
##  1      517            515         2      830            819        11
##  2      533            529         4      850            830        20
##  3      542            540         2      923            850        33
##  4      544            545        -1     1004           1022       -18
##  5      554            600        -6      812            837       -25
##  6      554            558        -4      740            728        12
##  7      555            600        -5      913            854        19
##  8      557            600        -3      709            723       -14
##  9      557            600        -3      838            846        -8
## 10      558            600        -2      753            745         8
## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Helper functions:

  • starts_with("abc"): matches names that begin with “abc”.
  • ends_with("xyz"): matches names that end with “xyz”.
  • contains("ijk"): matches names that contain “ijk”.
  • matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters.
  • num_range("x", 1:3): matches x1, x2 and x3.
  • everything(): selects columns you haven’t mentioned yet.
# moves time_hour and air_time to the front of the tibble
select(flights, time_hour, air_time, everything())
## # A tibble: 336,776 x 19
##    time_hour           air_time  year month   day dep_time sched_dep_time
##    <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
##  1 2013-01-01 05:00:00      227  2013     1     1      517            515
##  2 2013-01-01 05:00:00      227  2013     1     1      533            529
##  3 2013-01-01 05:00:00      160  2013     1     1      542            540
##  4 2013-01-01 05:00:00      183  2013     1     1      544            545
##  5 2013-01-01 06:00:00      116  2013     1     1      554            600
##  6 2013-01-01 05:00:00      150  2013     1     1      554            558
##  7 2013-01-01 06:00:00      158  2013     1     1      555            600
##  8 2013-01-01 06:00:00       53  2013     1     1      557            600
##  9 2013-01-01 06:00:00      140  2013     1     1      557            600
## 10 2013-01-01 06:00:00      138  2013     1     1      558            600
## # ... with 336,766 more rows, and 12 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
## #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
## #   hour <dbl>, minute <dbl>

rename() is useful to rename columns.

rename(flights, tail_num = tailnum)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tail_num <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Exercises

  1. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
select(flights, dep_time, dep_delay, arr_time, arr_delay)
## # A tibble: 336,776 x 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ... with 336,766 more rows
select(flights, starts_with("dep"), starts_with("arr"))
## # A tibble: 336,776 x 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ... with 336,766 more rows
select(flights, c(4,6,7,9))
## # A tibble: 336,776 x 4
##    dep_time dep_delay arr_time arr_delay
##       <int>     <dbl>    <int>     <dbl>
##  1      517         2      830        11
##  2      533         4      850        20
##  3      542         2      923        33
##  4      544        -1     1004       -18
##  5      554        -6      812       -25
##  6      554        -4      740        12
##  7      555        -5      913        19
##  8      557        -3      709       -14
##  9      557        -3      838        -8
## 10      558        -2      753         8
## # ... with 336,766 more rows
  1. What happens if you include the name of a variable multiple times in a select() call?
select(flights, year, year, day, year)
## # A tibble: 336,776 x 2
##     year   day
##    <int> <int>
##  1  2013     1
##  2  2013     1
##  3  2013     1
##  4  2013     1
##  5  2013     1
##  6  2013     1
##  7  2013     1
##  8  2013     1
##  9  2013     1
## 10  2013     1
## # ... with 336,766 more rows

It only returns it the one time.

  1. What does the one_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
## # A tibble: 336,776 x 5
##     year month   day dep_delay arr_delay
##    <int> <int> <int>     <dbl>     <dbl>
##  1  2013     1     1         2        11
##  2  2013     1     1         4        20
##  3  2013     1     1         2        33
##  4  2013     1     1        -1       -18
##  5  2013     1     1        -6       -25
##  6  2013     1     1        -4        12
##  7  2013     1     1        -5        19
##  8  2013     1     1        -3       -14
##  9  2013     1     1        -3        -8
## 10  2013     1     1        -2         8
## # ... with 336,766 more rows

It’s kind of line %in% for select – if the column matches one of the names in the vector, it is returned.

  1. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
## # A tibble: 336,776 x 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time
##       <int>          <int>    <int>          <int>    <dbl>
##  1      517            515      830            819      227
##  2      533            529      850            830      227
##  3      542            540      923            850      160
##  4      544            545     1004           1022      183
##  5      554            600      812            837      116
##  6      554            558      740            728      150
##  7      555            600      913            854      158
##  8      557            600      709            723       53
##  9      557            600      838            846      140
## 10      558            600      753            745      138
## # ... with 336,766 more rows, and 1 more variable: time_hour <dttm>

It looks like case doesn’t matter for select. You can change it wil ignore.case = FALSE.

select(flights, contains("TIME", ignore.case = FALSE))
## # A tibble: 336,776 x 0

Add new variables with mutate()

Examples

Mutate makes new variables and adds them to the end of the tibble. You can use these new variables to create even more variables all in the same line.

flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
## # A tibble: 336,776 x 11
##     year month   day dep_delay arr_delay distance air_time  gain speed
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
##  1  2013     1     1         2        11     1400      227    -9  370.
##  2  2013     1     1         4        20     1416      227   -16  374.
##  3  2013     1     1         2        33     1089      160   -31  408.
##  4  2013     1     1        -1       -18     1576      183    17  517.
##  5  2013     1     1        -6       -25      762      116    19  394.
##  6  2013     1     1        -4        12      719      150   -16  288.
##  7  2013     1     1        -5        19     1065      158   -24  404.
##  8  2013     1     1        -3       -14      229       53    11  259.
##  9  2013     1     1        -3        -8      944      140     5  405.
## 10  2013     1     1        -2         8      733      138   -10  319.
## # ... with 336,766 more rows, and 2 more variables: hours <dbl>,
## #   gain_per_hour <dbl>

If you only want to keep the new variables, use transmute().

transmute(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)
## # A tibble: 336,776 x 4
##     gain speed hours gain_per_hour
##    <dbl> <dbl> <dbl>         <dbl>
##  1    -9  370. 3.78          -2.38
##  2   -16  374. 3.78          -4.23
##  3   -31  408. 2.67         -11.6 
##  4    17  517. 3.05           5.57
##  5    19  394. 1.93           9.83
##  6   -16  288. 2.5           -6.4 
##  7   -24  404. 2.63          -9.11
##  8    11  259. 0.883         12.5 
##  9     5  405. 2.33           2.14
## 10   -10  319. 2.3           -4.35
## # ... with 336,766 more rows

You can use any function with mutate() as long as it is vectorized: it must take a vector of values as input, return a vector with the same number of values as output.

Useful functions:

  • Modular arithmetic: %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y)
  • Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)).
  • Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means.
  • Ranking: there are a number of ranking functions, but you should start with min_rank(). If min_rank() doesn’t do what you need, look at the variantsrow_number(),dense_rank(),percent_rank(),cume_dist(),ntile()`

Exercises

  1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
flights_sml <- select(flights, contains("dep"))
mutate(flights_sml,
       dep_min_since_midn = (dep_time %/% 100) * 60 + (dep_time %% 100),
       sched_dep_min_since_midn = (sched_dep_time %/% 100) * 60 + (sched_dep_time %% 100))
## # A tibble: 336,776 x 5
##    dep_time sched_dep_time dep_delay dep_min_since_midn sched_dep_min_sin…
##       <int>          <int>     <dbl>              <dbl>              <dbl>
##  1      517            515         2                317                315
##  2      533            529         4                333                329
##  3      542            540         2                342                340
##  4      544            545        -1                344                345
##  5      554            600        -6                354                360
##  6      554            558        -4                354                358
##  7      555            600        -5                355                360
##  8      557            600        -3                357                360
##  9      557            600        -3                357                360
## 10      558            600        -2                358                360
## # ... with 336,766 more rows
  1. Compare air_time with arr_time - dep_time. What do you expect to see? What do you see? What do you need to do to fix it?
mutate(select(flights, arr_time, dep_time, air_time),
       tot_time = arr_time - dep_time)
## # A tibble: 336,776 x 4
##    arr_time dep_time air_time tot_time
##       <int>    <int>    <dbl>    <int>
##  1      830      517      227      313
##  2      850      533      227      317
##  3      923      542      160      381
##  4     1004      544      183      460
##  5      812      554      116      258
##  6      740      554      150      186
##  7      913      555      158      358
##  8      709      557       53      152
##  9      838      557      140      281
## 10      753      558      138      195
## # ... with 336,766 more rows

I think I have to change arr_time and dep_time to continuous values.

mutate(select(flights, dep_time, arr_time, air_time),
       dep_cont = (dep_time %/% 100)* 60 + (dep_time %% 100),
       arr_cont = (arr_time %/% 100) * 60 + (arr_time %% 100),
       tot_time = arr_cont - dep_cont)
## # A tibble: 336,776 x 6
##    dep_time arr_time air_time dep_cont arr_cont tot_time
##       <int>    <int>    <dbl>    <dbl>    <dbl>    <dbl>
##  1      517      830      227      317      510      193
##  2      533      850      227      333      530      197
##  3      542      923      160      342      563      221
##  4      544     1004      183      344      604      260
##  5      554      812      116      354      492      138
##  6      554      740      150      354      460      106
##  7      555      913      158      355      553      198
##  8      557      709       53      357      429       72
##  9      557      838      140      357      518      161
## 10      558      753      138      358      473      115
## # ... with 336,766 more rows

This does not deal with flights that take off before midnight one night and land the next day. I would probably have to filter for those rows and do something different (maybe using lubridate?).

  1. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?
dep <- select(flights, dep_time, sched_dep_time, dep_delay)
mutate(dep,
       dep_cont = (dep_time %/% 100)* 60 + (dep_time %% 100),
       sched_dep_cont = (sched_dep_time %/% 100)* 60 + (sched_dep_time %% 100),
       calc_dep_delay = dep_cont - sched_dep_cont)
## # A tibble: 336,776 x 6
##    dep_time sched_dep_time dep_delay dep_cont sched_dep_cont
##       <int>          <int>     <dbl>    <dbl>          <dbl>
##  1      517            515         2      317            315
##  2      533            529         4      333            329
##  3      542            540         2      342            340
##  4      544            545        -1      344            345
##  5      554            600        -6      354            360
##  6      554            558        -4      354            358
##  7      555            600        -5      355            360
##  8      557            600        -3      357            360
##  9      557            600        -3      357            360
## 10      558            600        -2      358            360
## # ... with 336,766 more rows, and 1 more variable: calc_dep_delay <dbl>

dep_delay is a subtraction between continuous versions of dep_time andsched_dep_time`.

  1. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank().
delay <- select(flights, arr_delay)
arrange(mutate(delay,
               delay_rank = min_rank(arr_delay)), desc(delay_rank))
## # A tibble: 336,776 x 2
##    arr_delay delay_rank
##        <dbl>      <int>
##  1      1272     327346
##  2      1127     327345
##  3      1109     327344
##  4      1007     327343
##  5       989     327342
##  6       931     327341
##  7       915     327340
##  8       895     327339
##  9       878     327338
## 10       875     327337
## # ... with 336,766 more rows
  1. What does 1:3 + 1:10 return? Why?
1:3 + 1:10
##  [1]  2  4  6  5  7  9  8 10 12 11

It produces longer object length is not a multiple of shorter object length. This is because we’re trying to add two vectors of different lengths, not all of the objects within them.

  1. What trigonometric functions does R provide?

R has the following:

cos(x)
sin(x)
tan(x)

acos(x)
asin(x)
atan(x)
atan2(y, x)

# cospi(x), sinpi(x), and tanpi(x), compute cos(pi*x), sin(pi*x), and tan(pi*x)
cospi(x)
sinpi(x)
tanpi(x)

Grouped summaries with summarise()

Examples

summarise() collapses a data frame to a single row. It’s especially useful for grouped summaries.

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
## # A tibble: 365 x 4
## # Groups:   year, month [?]
##     year month   day delay
##    <int> <int> <int> <dbl>
##  1  2013     1     1 11.5 
##  2  2013     1     2 13.9 
##  3  2013     1     3 11.0 
##  4  2013     1     4  8.95
##  5  2013     1     5  5.73
##  6  2013     1     6  7.15
##  7  2013     1     7  5.42
##  8  2013     1     8  2.55
##  9  2013     1     9  2.28
## 10  2013     1    10  2.84
## # ... with 355 more rows

To make this code easier, we can use the pipe %>%.

# by destination, count the number of flights, mean distance, and mean arrival delay.
## only use locations with more than 20 flights that are not HNL.
delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

For most aggregation functions, if there is any missing data it will return NA. To avoid this, us na.rm = TRUE.

Also, it’s useful to always include a count n() or count of non-missing values sum(!is.na(x)) when looking at aggregated data, to make sure you’re not drawing conclusions on a small amount of data.

Useful summary functions:

  • Measures of location: mean(x), median(x)
  • Measures of spread: sd(x), IQR(x), mad(x) (last two robust to outliers)
  • Measures of rank: min(x), quantile(x, 0.25), max(x)
  • Measures of position: first(x), nth(x, 2), last(x)
  • Counts: n(), sum(!is.na(x)), n_distinct(x) (unique values)
  • Counts and proportions of logical values: sum(x > 10), mean(y == 0). Here, sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.

There’s also a special function for counts that takes a weighting variable if needed.

not_cancelled <- flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))
# sums total number of miles a given plane flew
not_cancelled %>% 
  count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
##    tailnum      n
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  239143
##  3 N10156  109664
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   24616
##  7 N10575  139903
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # ... with 4,027 more rows

ungroup() reverses group_by().

Exercises

  1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
  • A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
  • A flight is always 10 minutes late.
  • A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
  • 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
  • Which is more important: arrival delay or departure delay?

For the 50/50 flights, if we just take the mean of the delay we will get 0, which implies that the flight is basically always on time – but that’s not true. Instead we probably want to have some kind of measure of the variability in delay time as well as the mean delay time.

not_cancelled %>%
  group_by(carrier, flight) %>%
  summarise(count = n(),
            m_delay = mean(arr_delay, na.rm = TRUE),
            sd_delay = sd(arr_delay, na.rm = TRUE)) %>%
  filter(count > 25) %>%
  ggplot(mapping = aes(x = m_delay, y = sd_delay)) +
    geom_point(alpha = 0.5) + theme_bw()

Here we can see that the most consistent flights are also the ones that tend to arrive early. We can do the same thing with dep_delay.

not_cancelled %>%
  group_by(carrier, flight) %>%
  summarise(count = n(),
            m_delay = mean(dep_delay, na.rm = TRUE),
            sd_delay = sd(dep_delay, na.rm = TRUE)) %>%
  filter(count > 25) %>%
  ggplot(mapping = aes(x = m_delay, y = sd_delay)) +
    geom_point(alpha = 0.5) + theme_bw()

  1. Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).
# number of flights by destination
not_cancelled %>%
  group_by(dest) %>%
  summarise(count = n())
## # A tibble: 104 x 2
##    dest  count
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # ... with 94 more rows
# total number of miles by plane
not_cancelled %>%
  group_by(tailnum) %>%
  summarise(tot_mi = sum(distance, na.rm = TRUE))
## # A tibble: 4,037 x 2
##    tailnum tot_mi
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  239143
##  3 N10156  109664
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   24616
##  7 N10575  139903
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # ... with 4,027 more rows
  1. Our definition of cancelled flights (is.na(dep_delay) | is.na(arr_delay)) is slightly suboptimal. Why? Which is the most important column?

I think in this case the arr_delay column is more important, because flights can be delayed multiple times before they are actually cancelled. dep_time might just be the latest time the flight was expected to take off before it was cancelled. If arr_delay or arr_time is missing, this implies that the flight never landed in its destination. This could mean one of a few things:

  • the flight took off and never landed (a la Malaysian Airlines)
  • the flight took off and landed in a different airport than scheduled
  • the flight never actually took off.

In the case of the second option, we don’t have information about where a diverted flight actually did land, so for our purposes those flights are as good as cancelled. So, it seems like the best option here is to only use flights that have comlete data for arr_delay.

  1. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
flights %>%
  group_by(month, day) %>%
  summarise(count = n(),
            cancelled = sum(is.na(arr_delay)),
            mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
    ggplot(mapping = aes(x = cancelled, y = mean_delay)) +
    geom_point(alpha = 0.5) + theme_bw() + geom_smooth(se = FALSE)

Intuitively, it would make sense that days with worse weather would have both more cancelled flights and more delayed flights. It seems like it’s a U-shaped curve – likely, once the weather gets so bad, all the flights get cancelled rather than some delayed and some cancelled.

  1. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))
flights %>%
  group_by(carrier, dest) %>%
  summarise(count = n(),
            mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(mean_delay))
## # A tibble: 314 x 4
## # Groups:   carrier [16]
##    carrier dest  count mean_delay
##    <chr>   <chr> <int>      <dbl>
##  1 UA      STL       2      110  
##  2 OO      ORD       1      107  
##  3 OO      DTW       2       68.5
##  4 UA      RDU       1       56  
##  5 EV      CAE     113       42.8
##  6 EV      TYS     323       41.2
##  7 EV      PBI       6       40.7
##  8 EV      TUL     315       33.7
##  9 EV      OKC     346       30.6
## 10 UA      JAC      23       29.9
## # ... with 304 more rows
flights %>%
  group_by(dest) %>%
  summarise(count = n(),
            mean_delay = mean(arr_delay, na.rm = TRUE)) %>%
  arrange(desc(mean_delay))
## # A tibble: 105 x 3
##    dest  count mean_delay
##    <chr> <int>      <dbl>
##  1 CAE     116       41.8
##  2 TUL     315       33.7
##  3 OKC     346       30.6
##  4 JAC      25       28.1
##  5 TYS     631       24.1
##  6 MSN     572       20.2
##  7 RIC    2454       20.1
##  8 CAK     864       19.7
##  9 DSM     569       19.0
## 10 GRR     765       18.2
## # ... with 95 more rows

Looks like ExpressJet Airlines Inc. has pretty bad delays at lots of airports – CAE, TYS, PBI, TUL, OKC. United and SkyWest are also not great.

  1. What does the sort argument to count() do. When might you use it?

sort() sorts the output in descending order. This might be helpful if you want to look at the head of a table without using arrange().

Grouped mutates (and filters)

Exercises

  1. Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.

  2. Which plane (tailnum) has the worst on-time record?

flights  %>% 
  mutate(canceled = is.na(arr_time),
         late = !canceled & arr_delay > 0) %>%
  filter(canceled == FALSE) %>%
  group_by(tailnum) %>%
  summarise(count = n(),
            n_delay = sum(late, na.rm = TRUE),
            prop_delay = n_delay/count) %>%
  mutate(prop_rank = min_rank(desc(prop_delay))) %>%
  filter(count > 5) %>%
  arrange(prop_rank)
## # A tibble: 3,575 x 5
##    tailnum count n_delay prop_delay prop_rank
##    <chr>   <int>   <int>      <dbl>     <int>
##  1 N337AT     13      12      0.923       105
##  2 N169AT     11      10      0.909       106
##  3 N168AT     18      16      0.889       107
##  4 N261AT      9       8      0.889       107
##  5 N284AT      9       8      0.889       107
##  6 N290AT     16      14      0.875       110
##  7 N551NW      7       6      0.857       111
##  8 N273AT     13      11      0.846       112
##  9 N299AT      6       5      0.833       113
## 10 N388SW      6       5      0.833       113
## # ... with 3,565 more rows

This table summarizes the planes with the highest percentage of delayed flights (excl. cancelled flights). I’ve only included planes who have flown at least 5 times, since many planes have flown only once and were delayed. This shows that plane N337AT was delayed on 12/13 of its flights in 2013.

  1. What time of day should you fly if you want to avoid delays as much as possible?
flights %>%
  group_by(hour) %>%
  summarise(avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot(aes(x = hour, y = avg_delay)) + geom_bar(stat = "identity") + theme_bw()

In this question I’m using dep_delay instead of arr_delay because it seems like you would want to avoid delays at the gate when choosing flight time. It looks like early morning (5am) is the best time – average delay is less than 1 minute! This makes sense, because delays accumulate throughout the day as flights come in late and then leave late.

  1. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.
flights %>%
  filter(!is.na(arr_delay), arr_delay > 0) %>%
  group_by(dest) %>%
  mutate(total_delay = sum(arr_delay),
            prop_delay = arr_delay / total_delay) %>%
  select(flight, dest, total_delay, prop_delay)
## # A tibble: 133,004 x 4
## # Groups:   dest [103]
##    flight dest  total_delay prop_delay
##     <int> <chr>       <dbl>      <dbl>
##  1   1545 IAH         99391  0.000111 
##  2   1714 IAH         99391  0.000201 
##  3   1141 MIA        140424  0.000235 
##  4   1696 ORD        283046  0.0000424
##  5    507 FLL        202605  0.0000938
##  6    301 ORD        283046  0.0000283
##  7    194 LAX        203226  0.0000344
##  8    707 DFW        110009  0.000282 
##  9   4650 ATL        300299  0.0000400
## 10   4401 DTW        138258  0.000116 
## # ... with 132,994 more rows
  1. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using lag(), explore how the delay of a flight is related to the delay of the immediately preceding flight.
flights %>%
  arrange(origin, month, day, dep_time) %>% # arrange in order
  group_by(origin) %>% # group by airport to avoid cross-airport comparisons
  mutate(prev_delay = lag(dep_delay)) %>% # create lagged delay column
  group_by(dep_delay) %>% # otherwise, they'll have the same value -- this preserves lag
  summarise(m_prev_delay = mean(prev_delay, na.rm = TRUE)) %>% # find mean delay for previous flight
  ggplot(aes(dep_delay, m_prev_delay)) + geom_point() + theme_bw() + xlab("Current Flight Delay") + ylab("Prev Flight Delay")

Here we can see that there is a pretty strong relationship between the previous flight’s delay and the current flight’s delay until we get to about a 5 hour delay, at which point anything goes.

  1. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
flights %>%
  group_by(origin, dest) %>%
  mutate(med_air_time = median(air_time, na.rm = TRUE)) %>%
  select(month, day, flight, origin, dest, air_time, med_air_time) %>%
  mutate(diff = air_time - med_air_time) %>%
  arrange(diff)
## # A tibble: 336,776 x 8
## # Groups:   origin, dest [224]
##    month   day flight origin dest  air_time med_air_time  diff
##    <int> <int>  <int> <chr>  <chr>    <dbl>        <dbl> <dbl>
##  1     7     2   4667 EWR    MSP         93          149   -56
##  2     7    14    673 EWR    SNA        274          329   -55
##  3     9     6    771 JFK    LAX        275          329   -54
##  4     7     3    161 JFK    SEA        275          328   -53
##  5     7    12     21 JFK    LAX        277          329   -52
##  6     7    13    485 EWR    SNA        279          329   -50
##  7     7    15    329 EWR    SNA        279          329   -50
##  8     8    27    797 JFK    LAX        279          329   -50
##  9     7    15    712 JFK    LAX        280          329   -49
## 10     9    28     15 EWR    HNL        562          611   -49
## # ... with 336,766 more rows

If we compare the air time for each flight to the median air time for that origin-destination pair, we see that the fastest flights are only about an hour quicker than a normal flight between those airports. This doesn’t necessarily indicate an error in data entry. You can note that most of these quick flights happen in July – maybe when the weather is pretty clear.

flights %>%
  group_by(origin, dest) %>%
  mutate(med_air_time = median(air_time, na.rm = TRUE)) %>%
  select(month, day, flight, origin, dest, air_time, med_air_time) %>%
  mutate(diff = air_time - med_air_time) %>%
  arrange(desc(diff))
## # A tibble: 336,776 x 8
## # Groups:   origin, dest [224]
##    month   day flight origin dest  air_time med_air_time  diff
##    <int> <int>  <int> <chr>  <chr>    <dbl>        <dbl> <dbl>
##  1     7    28    841 JFK    SFO        490          347   143
##  2     1    28    575 JFK    EGE        382          255   127
##  3    11    22    426 JFK    LAX        440          329   111
##  4     9    10    745 LGA    DEN        331          227   104
##  5    11    22    587 EWR    LAS        399          298   101
##  6     6    29   1491 JFK    ACK        141           41   100
##  7     1     2   4204 EWR    OKC        286          190    96
##  8     1    15   4204 EWR    OKC        284          190    94
##  9     7    10     17 JFK    LAX        422          329    93
## 10    12     6    434 JFK    SFO        438          347    91
## # ... with 336,766 more rows

If we look at the opposite, we can see some flights that were in the air a lot longer than average. For example, the flight on 7/28 from JFK to SFO was in the air over 2 hours more than average.

  1. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
airports_carriers <- flights %>%
  group_by(dest) %>%
  summarise(n_carriers = n_distinct(carrier)) %>%
  filter(n_carriers > 2) %>%
  arrange(n_carriers)

flights %>%
  filter(dest %in% airports_carriers$dest) %>%
  group_by(carrier) %>%
  summarise(n_dest = n_distinct(dest)) %>%
  arrange(desc(n_dest))
## # A tibble: 15 x 2
##    carrier n_dest
##    <chr>    <int>
##  1 DL          37
##  2 EV          36
##  3 UA          36
##  4 9E          35
##  5 B6          30
##  6 AA          17
##  7 MQ          17
##  8 WN           9
##  9 OO           5
## 10 US           5
## 11 VX           3
## 12 YV           3
## 13 FL           2
## 14 AS           1
## 15 F9           1

It looks like Delta flies to the most multi-carrier airports, followed by ExpressJet and United.

  1. For each plane, count the number of flights before the first delay of greater than 1 hour.
flights %>%
  filter(!is.na(air_time)) %>%
  arrange(tailnum, month, day, dep_time) %>%
  mutate(one_hour = ifelse(dep_delay > 60, 1,0)) %>%
  group_by(tailnum) %>%
  mutate(cumsum = cumsum(one_hour)) %>%
  filter(cumsum == 0) %>%
  summarise(count = n())
## # A tibble: 3,804 x 2
##    tailnum count
##    <chr>   <int>
##  1 N0EGMQ     53
##  2 N10156      9
##  3 N102UW     25
##  4 N103US     46
##  5 N104UW      3
##  6 N105UW     22
##  7 N107US     20
##  8 N108UW     36
##  9 N109UW     48
## 10 N110UW     15
## # ... with 3,794 more rows

Chapter 7 - Exploratory Data Analysis

Two main questions for EDA:

  1. What type of variation occurs within my variables?

  2. What type of covariation occurs between my variables?

Variation

Variation is how a variable changes between measurements.

Categorical variables can be visualized using a bar chart.

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) + theme_bw()

Continuous variables can be visualized using a histogram.

ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5) + theme_bw()

To view multiple distributions of continuous variables at once, use geom_freqpoly().

smaller <- diamonds %>% 
  filter(carat < 3)

ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
  geom_freqpoly(binwidth = 0.1) + theme_bw()

Things to look out for: * Typical values: Why are some values more common than others? Why are some values non-existent in the dataset? Why do some values seem to cluster together? * Unusual values: What outliers exist? Does your analysis change if you exclude them?

Exercises

  1. Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
library(gridExtra)
x <- ggplot(diamonds, aes(x)) + geom_histogram() + theme_bw()
y <- ggplot(diamonds, aes(y)) + geom_histogram() + theme_bw()
z <- ggplot(diamonds, aes(z)) + geom_histogram() + theme_bw()
grid.arrange(x,y,z,ncol=1)

The documentation says that x is length, y is width, and z is depth. This suggests that most of the diamonds are not cut in round or square shapes where the length and width would be very similar. More information about the cut shape of the diamonds would help us determine this wihout the documentation.

  1. Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
ggplot(diamonds, aes(price)) + geom_histogram() + theme_bw()

Nothing super weird about this plot right now.

ggplot(diamonds, aes(price)) + geom_histogram(binwidth = 10) + theme_bw()

Making the bind size smaller shows us a weird gap somewhere between 0 and 2500 dollars. Let’s look at that.

lessthanthreek <- filter(diamonds, price < 3000)
ggplot(lessthanthreek, aes(price)) + geom_histogram(binwidth = 10) + theme_bw()

There’s a very specific range of prices around $1500 where there are no cases. That’s weird!

  1. How many diamonds are 0.99 carat? How many are 1 carat? What do you think is the cause of the difference?
diamonds %>%
  filter(carat == 0.99 | carat == 1) %>%
  group_by(carat) %>%
  summarise(count = n())
## # A tibble: 2 x 2
##   carat count
##   <dbl> <int>
## 1  0.99    23
## 2  1     1558

There are a lot more diamonds that are 1 carat than those that are 0.99. This is probably because people want to buy a 1-carat diamond (as opposed to a 0.99) – jewelers might be rounding up from 0.99 to 1.

  1. Compare and contrast coord_cartesian() vs xlim() or ylim() when zooming in on a histogram. What happens if you leave binwidth unset? What happens if you try and zoom so only half a bar shows?
ggplot(diamonds, aes(price)) + geom_histogram() + theme_bw() + coord_cartesian(xlim = c(200, 1000))

ggplot(diamonds, aes(price)) + geom_histogram() + theme_bw() + xlim(200, 1000)

It seems like coord_cartesian is simply zooming in on the larger plot – the data extends past the last tick mark, and there is only one bin (because the 30 total bins come from the entire dataset). However, xlim (and presumably ylim) subsets the data to those that fit the limits defined. So, our data between 200 and 1000 is binned into 30 pieces using xlim.

Missing data

Sometimes when we have outlier/weird/obviously-an-error data points, we will set those values to NA rather than drop the entire case. However, we have to think about how R is going to deal with missing data after that point.

Exercises

  1. What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
mis_diamonds <- diamonds %>%
  mutate(price_miss = ifelse(price > 750 & price < 1900, NA, price))

ggplot(mis_diamonds, aes(price_miss)) + geom_histogram(binwidth = 10) + theme_bw()

In a histogram, the missing values are just removed.

mis_diamonds <- diamonds %>%
  mutate(color_miss = ifelse(color == "H", NA, as.character(color)))

ggplot(mis_diamonds, aes(as.factor(color_miss))) + geom_bar() + xlab("Color") + theme_bw()

In a bar chart, they’re plotted as NA, so you can see how many missing values you have.

  1. What does na.rm = TRUE do in mean() and sum()?

For both of these functions, na.rm = TRUE removes NA values before doing the calculation.

Covariation

Covariation is the tendency for the values of two or more variables to vary together in a related way. We’ll look at this by checking out some plots.

Boxplots are good for looking at the relationship between one categorical and one continuous variable.

ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) + theme_bw() + xlab("car type")

For two categorical variables, we can make one of those cool grids.

diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))

When we are dealing with two continuous variables, we default to geom_point(). But it’s hard to read scatterplots when we have huge datasets. Other ways we can plot are with geom_bin2d() or geom_hex(). We can also bin a continuous variable and look at it in some boxplots.

ggplot(data = smaller) +
  geom_bin2d(mapping = aes(x = carat, y = price))

library(hexbin)
ggplot(data = smaller) +
  geom_hex(mapping = aes(x = carat, y = price))

ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

Exercises

  1. Use what you’ve learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.
flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(cancelled, sched_dep_time)) + geom_boxplot() + theme_bw()

Generally, cancelled flights were scheduled to leave later than non-cancelled flights

  1. What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

I would expect that the variable most related to price is carat. Let’s look at how it relates to cut.

ggplot(diamonds, aes(cut, carat)) + geom_boxplot() + theme_bw()

It looks like the biggest diamonds are also the ones that are the biggest. That is probably why the lower quality diamonds are the most expensive.

  1. Install the ggstance package, and create a horizontal boxplot. How does this compare to using coord_flip()?
library(ggstance)
ggplot(diamonds, aes(carat, cut)) + geom_boxploth() + theme_bw()

ggplot(diamonds, aes(cut, carat)) + geom_boxplot() + theme_bw() + coord_flip()

It looks like ggstance allows you to list the aesthetic mappings in the opposite way.

  1. One problem with boxplots is that they were developed in an era of much smaller datasets and tend to display a prohibitively large number of “outlying values”. One approach to remedy this problem is the letter value plot. Install the lvplot package, and try using geom_lv() to display the distribution of price vs cut. What do you learn? How do you interpret the plots?
library(lvplot)
ggplot(diamonds, aes(cut, price)) + geom_lv(aes(fill=..LV..)) + theme_bw() + scale_fill_lv()

The LV plot is fine (I guess), but I’d rather just use a violin plot if I want to know something about distributions.

ggplot(diamonds, aes(cut, price)) + geom_violin() + theme_bw()

  1. Compare and contrast geom_violin() with a facetted geom_histogram(), or a coloured geom_freqpoly(). What are the pros and cons of each method?
ggplot(diamonds, aes(cut, price)) + geom_violin() + theme_bw()

ggplot(diamonds, aes(price)) + geom_histogram() + theme_bw() + facet_wrap(~cut)

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500) + theme_bw()

Violin plots are great if you want to compare distributions side-by-side, as you can get a sense for central tendency as well as spread. I don’t like faceted histograms that much at all (esp. because that’s what a violin plot is, basically). The geom_freqpoly() is great for seeing relative frequencies, which are harder to see in violin plots.

  1. If you have a small dataset, it’s sometimes useful to use geom_jitter() to see the relationship between a continuous and categorical variable. The ggbeeswarm package provides a number of methods similar to geom_jitter(). List them and briefly describe what each one does.

Not going to explain these, but it’s a cool package to point out.

  1. How could you rescale the count dataset above to more clearly show the distribution of cut within colour, or colour within cut?
# cut within color
diamonds %>% 
  count(color, cut) %>%
  group_by(color) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = prop))

# color within cut
diamonds %>% 
  count(color, cut) %>%
  group_by(cut) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = prop))

  1. Use geom_tile() together with dplyr to explore how average flight delays vary by destination and month of year. What makes the plot difficult to read? How could you improve it?
flights %>%
  group_by(dest, month) %>%
  summarise(m_delay = mean(arr_delay, na.rm = TRUE)) %>%
  ggplot(mapping = aes(x = dest, y = month)) +
    geom_tile(mapping = aes(fill = m_delay))

First, there are a lot of destinations with no flights for a given month. We can change that by filtering. It’s also hard to read the continuous scale. I’ll change that by binning it.

flights %>%
  group_by(dest) %>% 
  mutate(n_flights = n()) %>%
  filter(n_flights > 15) %>%
  group_by(dest, month) %>%
  summarise(m_delay = mean(arr_delay, na.rm = TRUE)) %>%
  mutate(delay_bin = ifelse(m_delay < 0, "Early", 
                            ifelse(m_delay > 0 & m_delay < 30, "0-30 min late",
                                   ifelse(m_delay >30 & m_delay < 60, "30-60 min late",
                                          ifelse(m_delay >  60, "90 min late", NA))))) %>%
  mutate(delay_bin = factor(delay_bin, levels = c("Early", "0-30 min late", "30-60 min late", "90 min late"))) %>%
  ggplot(mapping = aes(x = as.factor(month), y = dest)) +
    geom_tile(mapping = aes(fill = delay_bin)) + scale_fill_brewer(direction = 1) + xlab("Month")

It would be nice to have a way to filter the relevant destinations as well.

  1. Why is it slightly better to use aes(x = color, y = cut) rather than aes(x = cut, y = color) in the example above?
diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))

diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = cut, y = color)) +
    geom_tile(mapping = aes(fill = n))

There are more colors, and it’s easier to read them along the x-axis.

  1. Instead of summarising the conditional distribution with a boxplot, you could use a frequency polygon. What do you need to consider when using cut_width() vs cut_number()? How does that impact a visualisation of the 2d distribution of carat and price?
ggplot(data = diamonds,
       mapping = aes(color = cut_number(carat, 5), x = price)) +
  geom_freqpoly() + theme_bw()

cut_width() makes a certain number of bins, so you need to make sure they give you the right resolution on your data.

ggplot(data = diamonds,
       mapping = aes(color = cut_width(carat, 1), x = price)) +
  geom_freqpoly() + theme_bw()

cut_number() makes bins of a certain width, so again, think about how you want to think about your data.

  1. Visualise the distribution of carat, partitioned by price.
ggplot(diamonds, aes(x = price, y = carat)) + 
  geom_boxplot(mapping = aes(group = cut_width(price, 1000))) + theme_bw()

  1. How does the price distribution of very large diamonds compare to small diamonds? Is it as you expect, or does it surprise you?

Very small diamonds have a much smaller distribution than very large diamonds. I would assume that for small diamonds, there is only so much variation that quality can give – a really great tiny diamond probably costs something similar to a not-so-great tiny diamond.

  1. Combine two of the techniques you’ve learned to visualise the combined distribution of cut, carat, and price.
ggplot(diamonds, aes(x = cut_number(price,5), y = carat, fill = cut)) + 
  geom_boxplot() + theme_bw()

  1. Two dimensional plots reveal outliers that are not visible in one dimensional plots. For example, some points in the plot below have an unusual combination of x and y values, which makes the points outliers even though their x and y values appear normal when examined separately.
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

Why is a scatterplot a better display than a binned plot for this case?

ggplot(data = diamonds) +
  geom_hex(mapping = aes(x = x, y = y)) +
  coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

Since most of the data are on the same line, it’s hard to read the binned plot here. We’re looking for individual outlier points, which are hard to see on this type of plot.

Chapter 8 - Projects

General ideas:

  1. Keep everything you do in your analysis in the script. If you run things right in the console, assume you will forget them later.

  2. Instead of using setwd() all the time, use an R project.

I think R projects are just basically directories that are structured in a standard way, where you use relative paths rather than absolute paths.

Chapter 10 - Tibbles

as.tibble works like as.data.frame() tibble() works like data.frame(), except that it doesn’t change the type of the input (e.g., string to factor)

You can use tribble() to make creating a data frame easier to read.

tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)
## # A tibble: 2 x 3
##   x         y     z
##   <chr> <dbl> <dbl>
## 1 a         2   3.6
## 2 b         1   8.5

Tibbles vs. Data Frames

Printing

By default, a tibble will only print the first 10 rows of a table and only those columns that fit on the screen. You can modify this if you want. width = Inf will print all the columns.

nycflights13::flights %>% 
  print(n = 10, width = Inf)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
##    sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
##             <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
##  1            819        11 UA        1545 N14228  EWR    IAH        227
##  2            830        20 UA        1714 N24211  LGA    IAH        227
##  3            850        33 AA        1141 N619AA  JFK    MIA        160
##  4           1022       -18 B6         725 N804JB  JFK    BQN        183
##  5            837       -25 DL         461 N668DN  LGA    ATL        116
##  6            728        12 UA        1696 N39463  EWR    ORD        150
##  7            854        19 B6         507 N516JB  EWR    FLL        158
##  8            723       -14 EV        5708 N829AS  LGA    IAD         53
##  9            846        -8 B6          79 N593JB  JFK    MCO        140
## 10            745         8 AA         301 N3ALAA  LGA    ORD        138
##    distance  hour minute time_hour          
##       <dbl> <dbl>  <dbl> <dttm>             
##  1     1400     5     15 2013-01-01 05:00:00
##  2     1416     5     29 2013-01-01 05:00:00
##  3     1089     5     40 2013-01-01 05:00:00
##  4     1576     5     45 2013-01-01 05:00:00
##  5      762     6      0 2013-01-01 06:00:00
##  6      719     5     58 2013-01-01 05:00:00
##  7     1065     6      0 2013-01-01 06:00:00
##  8      229     6      0 2013-01-01 06:00:00
##  9      944     6      0 2013-01-01 06:00:00
## 10      733     6      0 2013-01-01 06:00:00
## # ... with 3.368e+05 more rows

Exercises

  1. How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame).

It will print differently – in particular, tibbles will say A tibble at the top.

  1. Compare and contrast the following operations on a data.frame and equivalent tibble. What is different? Why might the default data frame behaviours cause you frustration?
df <- data.frame(abc = 1, xyz = "a")
df_tibble <- tibble(abc = 1, xyz = "a")
df$x
## [1] a
## Levels: a
df_tibble$x
## NULL

In this one, the data.frame returned the most similar column (xyz), while the tibble returned an error. Tibbles don’t do partial matching. A partial match could lead to errors – e.g., if you think you are calling column x but actually it was never created, so instead you are calling xyz.

class(df[, "xyz"])
## [1] "factor"
class(df_tibble[, "xyz"])
## [1] "tbl_df"     "tbl"        "data.frame"

Here, the data.frame returns a vector of a factor with one level (a), while the tibble returns a character tibble.

df[, c("abc", "xyz")]
##   abc xyz
## 1   1   a
df_tibble[, c("abc", "xyz")]
## # A tibble: 1 x 2
##     abc xyz  
##   <dbl> <chr>
## 1     1 a

In this case they return the same thing.

  1. If you have the name of a variable stored in an object, e.g. var <- "mpg", how can you extract the reference variable from a tibble?
var <- "mpg"
mtcars %>% .[[var]]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2
## [15] 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4
## [29] 15.8 19.7 15.0 21.4
  1. Practice referring to non-syntactic names in the following data frame by:
  2. Extracting the variable called 1.
  3. Plotting a scatterplot of 1 vs 2.
  4. Creating a new column called 3 which is 2 divided by 1.
  5. Renaming the columns to one, two and three.
annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)

annoying$`1`
##  [1]  1  2  3  4  5  6  7  8  9 10
annoying %>%
  ggplot(aes(`1`, `2`)) + geom_point()

annoying <- annoying %>%
  mutate(`3` = `2`/`1`)

names(annoying) <- c("one", "two", "three")
annoying
## # A tibble: 10 x 3
##      one    two three
##    <int>  <dbl> <dbl>
##  1     1  0.714 0.714
##  2     2  3.67  1.83 
##  3     3  5.89  1.96 
##  4     4  6.98  1.74 
##  5     5  9.64  1.93 
##  6     6 12.7   2.11 
##  7     7 14.4   2.06 
##  8     8 16.4   2.04 
##  9     9 18.5   2.06 
## 10    10 18.7   1.87
  1. What does tibble::enframe() do? When might you use it?

enframe() takes a named vector and turns it into a two-column data frame. It is useful to deal with a lot of the output from stats methods.

  1. What option controls how many additional column names are printed at the footer of a tibble?

You can use n_extra in print().

Chapter 11 - Data Import

Interesting things about read_csv()

  • You can provide an inline comma separated file.
  • skip = n lets you skip a few lines at the top of the file (good for E-Prime exports)
  • comment = "#" will skip any lines that start with #.

readr vs. base R

Reasons why read_csv() from readr is better than base R’s read.csv

  • It’s faster
  • It automatically creates tibbles
  • It is more reproducible

Parsing a vector

parse_*() takes two arguments: a vector and a collection of strings to treat as NA.

To parse numbers, you can give arguments to locale – e.g., changing from 1.00 to 1,00.

Sometimes it’s useful to parse factors as character vectors until you’ve finished cleaning them up. ### Exercises

1.What function would you use to read a file where fields were separated with |?

read_delim(path_to_file, delim = "|")
## Error in is.connection(x): object 'path_to_file' not found
  1. Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

col_names, col_types, locale, na, quoted_na, quote, trim_ws, skip, n_max, guess_max, progress.

  1. What are the most important arguments to read_fwf()?

Probably file and col_positions. col_positions says how wide each of the fixed-width columns are.

  1. Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or '. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?
read_csv("x,y\n1,'a,b'", quote = "'")
## # A tibble: 1 x 2
##       x y    
##   <int> <chr>
## 1     1 a,b
  1. Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
read_csv("a,b\n1,2,3\n4,5,6")
## # A tibble: 2 x 2
##       a     b
##   <int> <int>
## 1     1     2
## 2     4     5
read_csv("a,b,c\n1,2\n1,2,3,4")
## # A tibble: 2 x 3
##       a     b     c
##   <int> <int> <int>
## 1     1     2    NA
## 2     1     2     3

These ones have too many columns in one of the lower rows.

read_csv("a,b\n\"1")
## # A tibble: 1 x 2
##       a b    
##   <int> <chr>
## 1     1 <NA>

This one has an extra \.

read_csv("a,b\n1,2\na,b")
## # A tibble: 2 x 2
##   a     b    
##   <chr> <chr>
## 1 1     2    
## 2 a     b

This one is fine.

read_csv("a;b\n1;3")
## # A tibble: 1 x 1
##   `a;b`
##   <chr>
## 1 1;3

This one needs to have the separator changed to ;.

  1. What are the most important arguments to locale()?

The most important argument is probably date_names, where you can just set it to something like en or fr.

  1. What happens if you try and set decimal_mark and grouping_mark to the same character?
parse_number("123,456,789", locale = locale(grouping_mark = ",", decimal_mark = ","))
## Error: `decimal_mark` and `grouping_mark` must be different

You get an error.

What happens to the default value of grouping_mark when you set decimal_mark to ,?

parse_number("123,456,789", locale = locale(decimal_mark = ","))
## [1] 123.456

It changes to ..

What happens to the default value of decimal_mark when you set the grouping_mark to .?

parse_number("123,456,789", locale = locale(decimal_mark = "."))
## [1] 123456789

It changes to ,.

  1. I didn’t discuss the date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.

It lets you set what the date or time format should be manually. E.g.,

str(parse_guess("01/02/2013", locale = locale(date_format = "%d/%m/%Y")))
##  Date[1:1], format: "2013-02-01"
  1. If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.

I live inside the US!

  1. What’s the difference between read_csv() and read_csv2()?

read_csv() uses , for the separator and . for decimals. read_csv2() uses ; for the separator and , for the decimal mark.

  1. What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.

ISO 8859 is used in Europe commonly (Greek, polish, Slovenian) as well as windows-1251 and windows-1254. Chinese uses Big5 and gb18030, Korean uses EUC-KR, Japanese uses Shift_JIS

  1. Generate the correct format string to parse each of the following dates and times:
d1 <- "January 1, 2010"
parse_date(d1, "%B %d, %Y")
## [1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, "%Y-%b-%d")
## [1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, "%d-%b-%Y")
## [1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, "%m/%d/%y")
## [1] "2014-12-30"
t1 <- "1705"
parse_time(t1, "%H%M")
## 17:05:00
t2 <- "11:15:10.12 PM"
parse_time(t2, "%I:%M:%OS %p")
## 23:15:10.12

Parsing a file

If your columns are not coming in as the right type, you can tell readr how to parse them manually.

Writing to a file

If you want to save interim results, you can use write_rds and read_rds to preserve column types.

Chapter 12 - Tidy Data

Tidy data follows a particular structure:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Missing Data

Missing data can exist explicitly (as a value of NA) or implicitly (by just not being present).

complete() takes all of the combinations of a set of columns (like subject and condition) and makes one row for each combination, filling in empty cells with NA.

fill() will replace missing cells with the most recent non-missing value (based on row index).

Exercises

  1. Using prose, describe how the variables and observations are organised in each of the sample tables.

table1: Each row has cases and population for a single year in a single country. table2: This is a long dataset (good for plotting, sometimes). Each row has some observation (either cases or population) for a single country in a single year. table3: Here the numeric data (rate) is stored as a character vector! This is pretty obnoxious. table4a & table4b: Each table has one row per country showing values for something with separate columns per year.

  1. Compute the rate for table2, and table4a + table4b. You will need to perform four operations:
  • Extract the number of TB cases per country per year.
  • Extract the matching population per country per year.
  • Divide cases by population, and multiply by 10000.
  • Store back in the appropriate place. Which representation is easiest to work with? Which is hardest? Why?
table2 %>%
  spread(key = type, value = count) %>%
  mutate(rate = cases/population * 10000)
## # A tibble: 6 x 5
##   country      year  cases population  rate
##   <chr>       <int>  <int>      <int> <dbl>
## 1 Afghanistan  1999    745   19987071 0.373
## 2 Afghanistan  2000   2666   20595360 1.29 
## 3 Brazil       1999  37737  172006362 2.19 
## 4 Brazil       2000  80488  174504898 4.61 
## 5 China        1999 212258 1272915272 1.67 
## 6 China        2000 213766 1280428583 1.67
table4a_2 <- table4a %>%
  gather(key = year, value = cases, `1999`, `2000`)

table4b_2 <- table4b %>%
  gather(key = year, value = population, `1999`, `2000`)

table4 <- merge(table4a_2, table4b_2)

table4 %>%
  mutate(rate = cases/population * 10000)
##       country year  cases population     rate
## 1 Afghanistan 1999    745   19987071 0.372741
## 2 Afghanistan 2000   2666   20595360 1.294466
## 3      Brazil 1999  37737  172006362 2.193930
## 4      Brazil 2000  80488  174504898 4.612363
## 5       China 1999 212258 1272915272 1.667495
## 6       China 2000 213766 1280428583 1.669488

Definitely table2 is easier because you can just reshape it into wide format easily. For the two separate tables, you have to 1) know what data they show (population vs. cases) 2) reshape them 3) merge them, all before creating the rate variable.

  1. Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))

table2 %>%
  spread(key = type, value = count) %>%
  ggplot(aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))

You have to use spread() to get the right data format.

  1. Why are gather() and spread() not perfectly symmetrical? Carefully consider the following example:
stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
  )
stocks %>% 
  spread(year, return) %>% 
  gather("year", "return", `2015`:`2016`)
## # A tibble: 4 x 3
##    half year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     2 2015    0.59
## 3     1 2016    0.92
## 4     2 2016    0.17

(Hint: look at the variable types and think about column names.)

The year variable is originally numeric, but then when it becomes a column name through spread() and then a value again through gather(), it ends up as a character variable. These functions do not preserve variable type.

  1. Both spread() and gather() have a convert argument. What does it do?

The convert variable deals with the issue above, turning character column names into numeric/logical/integer.

  1. Why does this code fail?
table4a %>% 
  gather(1999, 2000, key = "year", value = "cases")
## Error in inds_combine(.vars, ind_list): Position must be between 0 and n
#> Error in inds_combine(.vars, ind_list): Position must be between 0 and n

When you use numbers for column names, you have to use tick marks to surround them or else they will not work.

  1. Why does spreading this tibble fail? How could you add a new column to fix the problem?
people <- tribble(
  ~name,             ~key,    ~value,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
  )

There are two rows that have the same name and key (Phllip Woods, age). You could add an id column if you want.

  1. Tidy the simple tibble below. Do you need to spread or gather it? What are the variables?
preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
  )

This column should be gathered.

preg %>%
  gather(key = gender, value = number, male, female)
## # A tibble: 4 x 3
##   pregnant gender number
##   <chr>    <chr>   <dbl>
## 1 yes      male       NA
## 2 no       male       20
## 3 yes      female     10
## 4 no       female     12

The variables are pregnant, gender, and number.

  1. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"))
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f    
## 3 h     i     j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"))
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     <NA> 
## 3 f     g     i

extra lets you choose what to do with extra pieces (e.g., in row 2). The default is to drop with a warning.

fill lets you choose what to do when there are not enough pieces. By default it fills from the right (if nothing on the right, then NA).

  1. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

remove lets you choose if you want to keep the original column(s) or omit them. You can set it to FALSE if you want to keep the original column in the data frame.

  1. Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite?

separate() splits a single column into multiple columns based on a separator or position. extract() uses a regular expression to get capturing groups and then puts them into columns. Thus, extract() lets you look for something different with each section of the column/string.

There are lots of ways to split a string, but only one way to paste it back together!

  1. Compare and contrast the fill arguments to spread() and complete().

In spread(), fill lets you choose what to replace missing values (explicit and implicit) with.

In complete(), fill also lets you choose what to use for missing combinations (explicit and implicit), but you can set it for each variable.

  1. What does the direction argument to fill() do?

It lets you choose whether to copy entries up the data frame or down the data frame.

  1. In this case study I set na.rm = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?
table(who$new_ep_f5564)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
##  253   90   53   40   31   24   32   21   12   19   17   12   10   12    4 
##   15   16   17   18   19   20   21   22   23   24   25   26   28   29   30 
##    7    8    5    7    6    5    4    2    5    3    4    4    8    6    7 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##    3    5    5    2    7    5    4    3    4   10    3    4    8    2    1 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   11    1    2    2    5    2    2    5    2    3    3    5    2    3    2 
##   61   62   63   64   65   69   70   73   74   75   76   78   80   82   83 
##    3    2    2    2    3    1    1    1    1    2    1    2    2    3    2 
##   84   87   91   92   93   95  101  102  106  107  108  109  110  111  112 
##    2    1    1    3    1    1    1    1    2    1    1    1    1    1    2 
##  113  115  117  118  120  121  123  124  125  126  127  128  129  131  132 
##    2    1    2    2    1    2    1    2    1    1    1    1    2    1    2 
##  134  137  138  139  140  141  142  143  144  145  146  148  149  150  151 
##    2    1    1    1    1    1    3    1    1    1    1    1    1    1    1 
##  152  155  156  161  162  163  164  170  172  173  174  176  181  182  183 
##    3    2    2    1    2    1    1    1    1    2    2    1    1    1    1 
##  186  187  188  192  195  200  207  209  231  238  244  248  253  256  266 
##    1    1    1    1    1    1    1    1    1    1    1    2    1    1    1 
##  271  272  283  287  301  302  307  317  339  348  357  360  361  374  379 
##    2    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  380  384  385  386  387  388  393  396  397  399  409  415  420  421  435 
##    1    1    1    1    1    1    1    1    2    1    1    2    3    1    2 
##  442  463  475  483  493  513  520  563  571  578  582  584  614  620  628 
##    1    1    2    1    1    1    1    1    1    1    1    1    1    1    1 
##  631  702  849  901  969  992 1118 1186 1498 1726 1894 4684 
##    1    1    1    1    1    1    1    1    1    1    1    1
table(is.na(who$new_ep_f5564))
## 
## FALSE  TRUE 
##  1017  6223

In this dataset it seems like missing values are truly missing, which implies that it has mostly explicit missingness. table() shows that there are entered values of 0.

  1. What happens if you neglect the mutate() step? (mutate(key = stringr::str_replace(key, "newrel", "new_rel")))
who %>%
  gather(key, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
## # A tibble: 76,046 x 6
##    country      year var   sex   age   value
##    <chr>       <int> <chr> <chr> <chr> <int>
##  1 Afghanistan  1997 sp    m     014       0
##  2 Afghanistan  1998 sp    m     014      30
##  3 Afghanistan  1999 sp    m     014       8
##  4 Afghanistan  2000 sp    m     014      52
##  5 Afghanistan  2001 sp    m     014     129
##  6 Afghanistan  2002 sp    m     014      90
##  7 Afghanistan  2003 sp    m     014     127
##  8 Afghanistan  2004 sp    m     014     139
##  9 Afghanistan  2005 sp    m     014     151
## 10 Afghanistan  2006 sp    m     014     193
## # ... with 76,036 more rows

You will have values of key that mean the same thing but are counted as different variables because there are slight variations in how they were named.

  1. I claimed that iso2 and iso3 were redundant with country. Confirm this claim.
who %>%
  group_by(country, iso2, iso3) %>%
  summarise(freq = n()) %>% 
  group_by(country) %>%
  summarise(freq2 = n()) %>%
  filter(freq2 > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: country <chr>, freq2 <int>

No country has more than one combination of country, iso2 and iso3.

  1. For each country, year, and sex compute the total number of cases of TB. Make an informative visualisation of the data.
who %>%
  gather(key, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>% 
  mutate(key = stringr::str_replace(key, "newrel", "new_rel")) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1) %>%
  group_by(country, year, sex) %>%
  summarise(count = n()) %>%
  ggplot(aes(year, count)) + geom_point() + facet_wrap(~sex) + theme_bw()

It doesn’t really make sense to plot all 200+ countries. Perhaps if I had region information I could color-code the points on the plot. Ideally, there would be lines connecting the points from a given region to give a sense of change over time.

Chapter 13 - Relational Data

Relational data takes place over multiple tables.

Mutating joins: add new variables to one data frame from matching observations in another.

Filtering joins: filter observations from one data frame based on whether or not they match an observation in the other table.

Set operations: treat observations as if they were set elements.

Exercises

  1. Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?

You’d need to join airports on flights to get the lat/long of the airports.

  1. I forgot to draw the relationship between weather and airports. What is the relationship and how should it appear in the diagram?

origin in weather = faa in airports.

  1. weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights?

You would be able to join weather on flights either by origin or destination, depending on which data you wanted.

  1. We know that some days of the year are “special”, and fewer people than usual fly on them. How might you represent that data as a data frame? What would be the primary keys of that table? How would it connect to the existing tables?
flights %>%
  left_join(select(planes, tailnum, seats)) %>%
  group_by(month,day) %>%
  summarise(n_seats = sum(seats, na.rm = TRUE)) %>%
  arrange(n_seats)
## # A tibble: 365 x 3
## # Groups:   month [12]
##    month   day n_seats
##    <int> <int>   <int>
##  1     2     9   59809
##  2    11    28   74750
##  3    12    14   75186
##  4    11    29   78080
##  5     2     2   78490
##  6     1    26   79846
##  7     1    19   79947
##  8    12     7   80110
##  9     1    12   80170
## 10     9     7   82605
## # ... with 355 more rows

This data frame shows the total number of seats available in all of the flights departing on a given day from NYC airports. The fewest seats available are shown at the top of the dataset. The primary key is the combination of month and day, which allows you to join it (not so well) to other tables.

We could also make a data frame of days that people might not travel often (Christmas, New Years’, etc.), but the data shows that this is not really the case. However, it might be useful if we want to filter those days (or other holiday travel) out of the other data frames.

  1. Add a surrogate key to flights.
flights %>%
  mutate(id = row_number())
## # A tibble: 336,776 x 20
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, id <int>
  1. Identify the keys in the following datasets
library(Lahman)
library(babynames)
library(nasaweather)
library(fueleconomy)
  • Lahman::Batting
Lahman::Batting %>%
  count(playerID, yearID, stint) %>%
  filter(n > 1)
## # A tibble: 0 x 4
## # ... with 4 variables: playerID <chr>, yearID <int>, stint <int>, n <int>

Combo of playerID, yearID and stint.

  • babynames::babynames
babynames::babynames %>%
  count(year, sex, name) %>%
  filter(nn > 1)
## # A tibble: 0 x 4
## # ... with 4 variables: year <dbl>, sex <chr>, name <chr>, nn <int>

Combo of year, sex, and name.

  • nasaweather::atmos
nasaweather::atmos %>%
  count(lat, long, year, month) %>%
  filter(n > 1)
## # A tibble: 0 x 5
## # ... with 5 variables: lat <dbl>, long <dbl>, year <int>, month <int>,
## #   n <int>

Combo of lat, long, year, and month.

  • fueleconomy::vehicles
fueleconomy::vehicles %>%
  count(id) %>%
  filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: id <int>, n <int>

Just id!

  • ggplot2::diamonds

I don’t think there is one for this table!

(You might need to install some packages and read some documentation.)

  1. Draw a diagram illustrating the connections between the Batting, Master, and Salaries tables in the Lahman package.

playerID is common to Batting, Master, and Salary. To combine Batting and Salaries, you shouls use yearID as well.

Draw another diagram that shows the relationship between Master, Managers, AwardsManagers.

To join Managers and AwardsManagers, you can use playerID and yearID, probably filtering by plyrMgr == "Y". Then you can join in Master by playerID.

How would you characterise the relationship between the Batting, Pitching, and Fielding tables?

They have the same primary key – a combo of playerID, yearID, and stint.